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Abstract 

We consider the problem of adaptive strati- 
fied sampling for Monte Carlo integration of 
a noisy function, given a finite budget n of 
noisy evaluations to the function. We tackle 
in this paper the problem of adapting to the 
function at the same time the number of sam- 
ples into each stratum and the partition it- 
self. More precisely, it is interesting to refine 
the partition of the domain in area where the 
noise to the function, or where the variations 
of the function, are very heterogeneous. On 
the other hand, having a (too) refined strat- 
ification is not optimal. Indeed, the more 
refined the stratification, the more difficult 
it is to adjust the allocation of the samples 
to the stratification, i.e. sample more points 
where the noise or variations of the function 
are larger. We provide in this paper an algo- 
rithm that selects online, among a large class 
of partitions, the partition that provides the 
optimal trade-off, and allocates the samples 
almost optimally on this partition. 

1. Introduction 

The objective of this paper is to provide an efficient 
strategy for integrating a noisy function F. The 
learner can sample n times the function. If it sam- 
ples the function at a time t in a point xt of the do- 
main X that it can choose to its convenience, it obtains 
the noisy sample F(xt, et), where t± is drawn indepen- 
dently at random from some distribution C Xt , where 
C x is a probability distribution that depends on x. 

If the variations of the function F are known to the 
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learner, an efficient strategy is to sample more points 
in parts of the domain X where the variations of F 
are larger. This intuition is explained more formally 
in the setting of Stratified Sampling (see e.g. (Rubin- 
stein and Kroese, 2008)). 

More precisely, assume that the domain X is divided 
in Kj^f regions (according to the usual terminology of 
stratified sampling, we refer to these regions as strata) 
that form a partition J\f of X. It is optimal (for an 
oracle) to allocate a number of points in each stratum 
proportional to the measure of the stratum times a 
quantity depending of the variations of F in the stra- 
tum (see Subsection 5.5 of (Rubinstein and Kroese, 
2008)). We refer to this strategy as optimal oracle 
strategy for partition N. 

The problem is that the variations of the function F 
in each stratum of Af are unknown to the learner. In 
the papers (Etore and Jourdain, 2010; Grover, 2009; 
Carpentier and Munos, 2011a), the authors expose the 
problem of, at the same time, estimating the variations 
of F in each stratum, and allocating the samples opti- 
mally among the strata according to these estimates. 
Up to some variation in efficiency or assumptions, 
these papers provide learners that are indeed able to 
learn about the variations of the function and allocate 
optimally the samples in the strata, up to a negligible 
term. However, all these papers make explicit in the 
theoretical bounds, or at least intuitively, the existence 
of a natural trade-off in terms of the refinement of the 
partition. The more refined the partition (especially if 
it gets more refined where variations of F are larger), 
the smaller the variance of the estimate outputted by 
the optimal oracle strategy. However, the larger the 
error of an adaptive strategy with respect to this op- 
timal oracle strategy, since the more strata there are, 
the harder it is to adapt to each stratum. 

It is thus important to adapt also the partition to the 
function, and refine more the strata where variations 
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of the function F are larger, while at the same time 
limiting the number of strata. As a matter of fact, a 
good partition of the domain is such that, inside each 
stratum, the values taken by F are as homogeneous as 
possible (see Subsection 5.5 of (Rubinstein and Kroese, 
2008)), while at the same time the number of strata is 
not too large. 

There are some recent papers on how to stratify 
efficiently the space, e.g. (Glasserman et al., 1999; 
Kawai, 2010; Etore et al., 2011; Carpentier and Munos, 
2012a;b). More specifically, in the recent paper (Etore 
et al., 2011), the authors propose an algorithm for 
performing this task online and efficiently. They do 
not provide proofs of convergence for their algorithm, 
but they give some properties of optimal stratified 
estimate when the number of strata goes to infinity, 
notably convergence results under the optimal alloca- 
tion. They also give some intuitions on how to split 
efficiently the strata. Having an asymptotic vision 
of this problem prevents them however from giving 
clear directions on how exactly to adapt the strata, as 
well as from providing theoretical guarantees. In pa- 
per (Carpentier and Munos, 2012a), the authors pro- 
pose to stratify the domain according to some pre- 
liminary knowledge on the class of smoothness of the 
function. They however fix the partition before sam- 
pling and thus do not consider online adaptation of the 
partition to the function. Finally, although consider- 
ing online adaptation of the partition to the function, 
the paper (Carpentier and Munos, 2012b) considers 
the specific and somehow very different 1 setting where 
the noise e to the function F is null, and where F is 
differentiable according to x. 

Contributions: We consider in this paper the prob- 
lem of designing efficiently and according to the func- 
tion a partition of the space, and of allocating the sam- 
ples efficiently on this partition. More precisely, our 
aim is to build an algorithm that allocates the samples 
almost in an oracle way on the best possible partition 
(adaptive to the function F, i.e. that solves the trade- 
off that we named before) in a large class of partitions. 
We consider in this paper the class of partition to be 
the set of partitions defined by a hierarchical partition- 
ing of the domain (as for instance what was considered 
in (?) for function optimization). 

• We provide new, to the best of our knowledge, 
ideas for sampling a domain very homogeneously, 
i.e. such that the samples are well scattered. The 
sampling schemes we introduce share ideas with 

1 In this setting where the function F is noiseless and 
very regular, efficient strategies share ideas with quasi 
Monte-Carlo strategies, and the number of strata should 
be almost equal to the budget n. 



low discrepancy schemes (see e.g. (Nicdcrreiter, 
2010)), and provide some theoretic guarantees for 
their efficiency. 

• We provide an algorithm, called Monte-Carlo Up- 
per Lower Confidence band. We prove that it 
manages to at the same time select an optimal 
partition of the hierarchical partitioning and then 
to allocate the samples in this partition almost as 
an oracle would do. More precisely, we prove that 
its pseudo-risk is smaller, up to a constant, than 
the pseudo-risk of MC-UCB on any partition of 
the hierarchical partitioning. 

The rest of the paper is organised as follows. In Sec- 
tion 2 we formalise the problem and introduce the no- 
tations used throughout the paper. We also remind 
the problem independent bound for algorithm MC- 
UCB. Section 3 presents algorithm MC-ULCB, and its 
bound on the pseudo-risk. After a technical part on 
notations, we introduce what we call Balanced Sam- 
pling Scheme (BSS) and a variant of it, BSS-A. These 
are sampling schemes for allocating samples in a ran- 
dom yet almost low discrepancy way, on a domain. 
Algorithm MC-ULCB that we present afterwards re- 
lies heavily on them. We also discuss the results, and 
finally conclude the paper. 

2. Preliminaries 
2.1. The function 

Consider a noisy function F : (x, e) 6 X x ft — > R. 
In this definition, X is the domain on which the learner 
can choose in which point x to sample, and fl is a 
space on which the noise to the function e is defined. 
We define for any x € X the distribution of noise e 
conditional to x as C x . We also define a finite mea- 
sure v on X corresponding to a a— algebra whose sets 
belong to X . Without loss of generality, we assume 
that v(X) = 1 (y is a probability measure). 

The objective of the learner is to sample the domain 
X in order to build an efficient estimate of the in- 
tegral of the noisy function F according to the mea- 
sure (;/, C x \x), that is to say J x E £i ^£ x i ;l (a;, e x )d{v){x). 
The learner can sample sequentially the function n 
times, and observe noisy samples. When sampling the 
function at time t in x t , it observes a noisy sample 
F(x t , e 4 ). The noise e t ~ C Xt conditional to x t is inde- 
pendent of the previous samples (xi,£i)i<t. 

For any point x G X, define 

g(x) = E c „ c *F(x,e) and s(x) = [{F(x, e) - g(x)) 2 ^ . 

We state the following Assumption on the function 
Assumption 1 We assume that both g and s are 
bounded in absolute value by a constant / max . Let 
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v(x,e) = nx %^ (ifs(x) = 0, set v(x,e) = 0). 
We assume that 3b such that VA < | , 



exp(A-u(x, e)) 
exp(Aw(x, e) 2 



< exp 
-A) 



(w^E))> and 

eX P ^(l-Ab))" 



v 2(l-A6) y 

Assumption 1 means that the variations coming from 
the noise in F, although potentially unbounded, are 
not too large 2 . We believe that it is rather general. 
In particular, it is satisfied if F is bounded, or also 
for e.g. a bounded function perturbed by an additive, 
heterocedastic, (sub-) Gaussian noise. 

2.2. Notations for a hierarchical partitioning 

The strategies that we are going to consider for inte- 
gration are allowed to choose where to sample the do- 
main. In order to do that, the strategies we consider 
will partition the domain X into strata and sample 
randomly in the strata. In theory the stratification is 
at the discretion of the strategy and can be arbitrary. 
However in practice, we will consider strategies that 
rely on given hierarchical partitioning. 

Define a dyadic hierarchical partitioning T of the do- 
main X. More precisely, we consider a set of parti- 
tions of X at every depth h > 0: for any integer h, 
X is partitioned into a set of 2 h strata X[h,i], where 
< i < 2 h - 1. This partitioning can be represented 
by a dyadic tree structure, where each stratum Xi^a 
corresponds to a node [h, i] of the tree (indexed by its 
depth h and index i). Each node [h,i] has 2 children 
nodes [h + 1, 2i] and [h + 1, 2i + 1]. In addition, the 
strata of the children form a sub-partition of the par- 
ents stratum Xy h & . The root of the tree corresponds 
to the whole domain X . 

We state the following assumption on the measurabil- 
ity and on the measure of any stratum of the hierar- 
chical partitioning. 

Assumption 2 V[/i, i] € T, the stratum X[ h i ] is mea- 
surable according to the a— algebra on which the prob- 
ability measure v is defined. 

We write W[h,{\ the measure of stratum X^ h a, 
i.e. w\h.i] — K^fM])- We also assume that the hi- 
erarchical partitioning is such that all the strata of a 
given depth have same measure, i.e. W[h,{\ = uih- 
Assumption 3 \/[h, i] 6 T, the children strata 
of [h,i] are such that Wh+i = ^(<^[/i+i,2i]) = 

2 This assumption implies that the variations induced by 
the noise are sub-Gaussian. It is actually slightly stronger 
than the usual sub- Gaussian assumption. Nevertheless, 
e.g. bounded random variables and Gaussian random vari- 
ables satisfy it. 



If for example X = [0,1], a hierarchical partition- 
ing that satisfies the previous assumptions with the 
Lebesgue measure is illustrated in Figure 1. 




Figure 1. Example of hierarchical partitioning in dimen- 
sion 1. 

We write mean and variance of stratum X[ h ^ the mean 
and variance of a sample of the function F, collected 
in the point X, where X is drawn at random according 
to v conditioned to stratum Xv^^y We write 

M[M = Ex~„~ \E e ^c x [F(X,e)] \ = — / g{x)dv{x) 



[h,i] 



2 



the mean and 
J x . 



ih.i] 



(g(x) - fj, [h i] ) dv{x) + — 
V V wh 



s (x)du(x) 



[h,i\ 



.the variance, /we remind that q and s are defined in 
Assumption 1). 

2.3. Pseudo-performance of an algorithm and 
optimal static strategies 

We denote by A an algorithm that allocates the bud- 
get n and returns a partition M n = [ X\ h a ) 

V 1 ' V [h,i]£Af n 

included in the hierarchical partitioning T of the do- 
main. In each node [h,i] of J\f n , algorithm A allo- 
cates uniformly T^ h i ^ n random samples. We write 

( X\ h n t ) these samples, and we write 

A[ft,i],n = T 1 n J2t=i' ] ' n X [h,i],t the empirical mean 
built with these samples. We estimate the integral 
of F on X by /t„ = J2[h,i]eAf„ w hfa[h,i},n- This is the 
estimate returned by the algorithm. 

If J\f„ is fixed as well as the number Tp^i n of samples 
in each stratum, and if the Tkji „ samples are indepen- 
dent and chosen uniformly according to the measure v 
restricted to each stratum Akji, we have 



[h,i]eATn 

and also 



E 



g(u)dv(u) 



[h,i]eM„ "*lh,i] 



E ^ E (A[M] 



E 



Tn 



V(A„) = 

where the expectations and variance are computed 
with respect to the samples collected in the strata. 
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For a given algorithm A, we denote by pseudo-risk the 
quantity 



Ln(A)= 



2 9 



h,i] ,n 



(i) 



This measure of performance is discussed more in 
depths in papers (Grover, 2009; Carpentier and 
Munos, 2011b). In particular, paper (Carpentier and 
Munos, 2011b) links it with the mean squared error. 

Note that if, for a given partition Af, an algorithm A\f 
would have access the variances a? h ^ of the strata in 
Af, it could allocate the budget in order to minimise 
the pseudo-risk, by choosing to pick in each stratum 
X[h,i] (up to rounding issues) T,* = v ™ hg ' h '' 1 " sam _ 
pies. The pseudo risk for this oracle strategy is then 



Ln(Atf) 



[h,i]e^ W h a lh,i 



(2) 



where we write = J2xefJ w x a x- We also refer, in 
the sequel, as optimal allocation (for a partition Af) , to 
\h,i],Af = g-^il t Even when the optimal allocation 
is not realizable because of rounding issues, it can still 
be used as a benchmark since the quantity L n (A^) is a 
lower bound on the variance of the estimate outputted 
by any oracle strategy. 



2.4. Main result for algorithm MC-UCB and 
point of comparison 

Let us consider a fixed partition Af of the domain, 
and write Kj^ for the number of strata it contains. 
We first recall (and slightly adapt) one of the main 
results of paper (Carpentier and Munos, 2011b) (The- 
orem 2). It provides a result on the pseudo-risk of an 
algorithm called MC-UCB. This algorithm takes some 
parameters linked to upper bounds on the variability 
of the function 3 , a small probability S, and the par- 
tition Af. MC-UCB builds, for each stratum in the 
fixed 4 partition Af, an upper confidence band (UCB) 
on it's standard deviation, and allocates the samples 
proportionnal to the measure of each stratum times 
this UCB. Its pseudo-risk is bounded in high proba- 



bility by 



This theorem holds also 



in our setting. The fact that the measure v is finite 
together with Assumptions 2 and 1 imply that the dis- 
tribution of the samples obtained by sampling in the 
strata are sub-Gaussian (as a bounded mixture of sub- 
Gaussian random variables). We remind and slightly 
improve this theorem. 



3 It is needed that the function is bounded and that the 
noise to the function is sub-Gaussian. 

4 It is very important to note that the partition is fixed 
for this algorithm and that it only adapts the allocation to 
the function. 



Theorem 1 Under Assumptions 2 and 1, the pseudo- 
risk of MC-UCB 5 launched on partition Af with pa- 
rameters / m ax; b and S is bounded, if n > AK , with 
probability 1 — 5, 



L n (A 



MC-UCB) 



2/3 
,4/3' 



where C m i n = (4v2\/C4 -f 3/ max A) and A 
2^2(1 + 36 + 4/ ma x) log(4n 2 (3/ max ) 3 /<5). 



The bound in this Theorem is slightly sharper than 
in the original paper. The (improved) proof is in the 
Supplementary Material, see Appendix C.2 

We will use in the sequel the bound in this Theorem as 
a benchmark for the efficiency of any algorithm that 
adapts the partition. The aim will be to construct a 
strategy whose pseudo-regret is almost as small as the 
minimum of this bound over a large class of partitions 
(e.g. the partitions defined by the hierarchical parti- 
tioning). In paper (Carpentier and Munos, 2012a), it 
was proved that this bound is minimax optimal which 
makes it a sensible benchmark. 

The bound in this Theorem depends on two terms. 

s 2 

The first, which is the oracle optimal variance of 
the estimate on partition Af, decreases with the num- 
ber of strata, and more specifically if the strata are 
"well-shaped" (i.e. more strata where the variations 
of g and s are larger). On the other hand, the second 

2/3 

term, Ylx£j\f > increases when the partition is more 
refined. There are however two extremal situations 
for this term, leading to two very different behaviours 
with the number of strata. If the strata have all the 
same measure where Kjj is the number of strata 

in partition Af, then YjxoN = "■ Now if tne 
partition is very localised (i.e. exponential decrease of 



the measure of the strata), then whatever the num- 
ber of strata, J2 x eAf ^> is of order Oi^t/s), and the 
number of strata has no more influence than a 
constant. 

These two facts enlighten the importance of adapting 
the shape of the partition to the function by having 
potentially st rata of heterogeneous measure. 

5 In order to fit with the assumptions of this paper, we 
redefine Va; G Af and Vt < n the upper confidence bound 

defined in the original paper as B x> t = — w x ia x ,t + 
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3. Algorithm MC-ULCB 
3.1. Additional definitions for algorithm 
MC-ULCB 

Let S > 0. We first define A = 

2^2(1 + 36 + 4/ max ) log(4n 2 (3/ max ) 3 /<5) where / max 
and b are chosen such that they satisfy Assumption 1. 
Set also for any h, th = [Awf/ 3 n 2 / 3 \ . 

Let [h, i] be a node of the hierarchical partitioning. 

Assume that the children ([h + 1, 2i], [h + 1, 2i + 1]) of 
node [h,i] have received at least th+i samples (and 
stratum X[h,{\ has received at least 2th+i samples). 
The standard deviations (T\h+i ,j] (for j £ {2i, 2i + 1}) 
are computed using the first t^+i samples only: 



T [h+l,j] 



th+1 



h+l,j],u 



■ X! X [h+l,j],k) 2 i 



where X^h+i .j] . u is the u-th sample in stratum Ak+iQS) 
We also introduce another estimate for the standard 
deviation &[h,i]i namely &[h,i\, which is computed with 
the first 2th+\ samples in stratum X^ h .i] (and not with 
the first th samples as eh/j,,]): 



T [h. : 



\ 



1 



2t 



h+l 



E 



( X [h,i], 



1 



21 



u=i -~ h+1 fc=l 

We use this estimate for technical purposes only. (4) 

We now define by induction the value r for any stra- 
tum X[h,i\- We initialise the r when there is enough 
points i.e. at least to points in stratum <f r i , by 

r [o,o] = £[o,o] - ^1- Assume that r [hA is defined. 
Whenever there are at least trfo+i] points in strata 
X[h+i,j] for j £ {2i, 2i + l}, we define the value r\h+i,j] 
for j £ {2i, 2i + 1} (and j~ the other) as 



T '[h+l,j] = 



>Wh+lfr[h+l,j] + CA 



2/3 



Whcr[h,i] 



(5) 



, 2/3 



+ 



Wh+l<T[h+l,j] — C 



2/3 



2/3 



+ mm 



min (5-[m-i,j]> + c v^4" 



!""i+l ff [k+l,3'-] — w h+10[h+l,)] 



< 2cVA- 



n l/3 

2/3 
"h+l 

2/3 
V+l 



where c = (8E + l)v^A, S = £[o,o] + i^r- ^ is either a 
(proportional) upper, or a (proportional) lower confi- 



dence bound on wih+x^^ih+ij]- ^ is a (proportional) 
upper confidence bound for the stratum [h + 1, j] that 
has the smallest empirical standard deviation, and a 
(proportional) lower confidence bound for the other. If 
the quantities W[h+ij]d-[ h+ i,2i] and W[ h+ i,j]a[h+i,2i+i] 
are too close, we set the same value to both sub-strata. 
The quantities r^ h ,{\ are key elements in algorithm MC- 
ULCB, and they account for the name of the algorithm 
(Monte Carlo Upper Lower Confidence Bound). 

Additional to that, we define the technical quantities 

H = +l,B = 38V2Ac(l + I) and 

C' max = max(B, UHcy/A) + 2y/J. 

3.2. Sampling Schemes 

The algorithm MC-ULCB that we will consider in 
the next Subsection works by updating a partition of 
the domain, refining it more where it seems necessary 
(i.e. where the algorithms detects that g or s have large 
variations). In order to do that, the algorithm needs 
to split some nodes [h, i] in their children nodes. We 
thus need guarantees on the number of samples in each 
child node [h + l, 2i] and [h + 1, 2i + 1] , when there are 
t samples in [h,i]. More precisely, we would like to 
have, up to rounding issues, t/2 samples in each child 
node. 

The problem is that usual sampling procedures do not 
guarantee that. In particular, if one chooses the naive 
idea for sampling stratum X[h,i] , i.e. collect t sam- 
ples independently at random according to v%. H 4] , then 
there is no guarantees on the exact numbers of samples 
in [h+l, 2i] and [h + 1, 2i + 1\. However, we would 
like that the sampling scheme that we use conserve 
the nice properties of sampling according to Vx, h 4] , 
i.e. that the empirical mean built on the samples re- 
mains an unbiased estimate of ^\h,i] and that it has a 
variance smaller than or equal to a^ h ^/t. 
This is one of the reasons why we need alternative 
sampling schemes 

The Balanced Sampling Scheme We first de- 
scribe what we call Balanced Sampling Scheme (BSS). 

We design this sampling scheme in order to be able to 
divide at any time each stratum, so that at any time, 
the number of points in each sub-stratum is propor- 
tional to the measure of the sub-stratum (up to one 
sample of difference) . 

The proposed methodology is the following recursive 
procedure. Consider a stratum Xi^n, indexed by 
node [h, i] and that has already been sampled ac- 
cording to the BSS t times. It has two children in 
the hierarchical partitioning, namely [h + l,2i] and 
[h + 1, 2% + 1]. If they have been sampled a different 
number of times, e.g. T[ h+lt2 i\ < T[ h+li2i +i], we choose 
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the child that contains the smallest number of points, 
e.g. [h + l,2i + 1], and apply BSS to this child. If 
the number of points in each of these nodes is equal, 
i.e. T[ h+1: 2i] = T[ h+ i t2i+ ij, we choose uniformly at ran- 
dom one of these two children, and apply BSS to this 
child. Then we iterate the procedure in this node, until 
for some depth h + I and node j, one has TJ/ l+; ji = 0. 
Then when T[ h+ i j] = 0, sample randomly a point in 
stratum A^+zj], according to Vx< h+l * ■ This provides 
the (t + l)th sample. 

We provide in Figure 2 the pseudo-code of this re- 
cursive procedure. An immediate property is that 



l,2j])2]p+l,2jH 



X=BSS([p,j]) 
if TJp+i^j] ^ TJp +1> 2j+i] then 
return BSS( argmin(Tjp_| 
else if T [p+1>2j ] = T [p+li2j+1] > then 

return BSS([p+ 1, 2j + B(l/2)) 
else 

return X ~ ux, 
endif 



Figure 2. Recursive BSS procedure. B(l/2) is a sample of 

the Bernouilli distribution of parameter 1/2 (i.e. we sample 

at random among .the two children strata). 

if stratum [n, if is sampled t times according to the 

BSS, any descendant stratum [p,j] of [h, i] is such that 

LPJJ — Lw h J — w h 

We also provide the following Lemma providing prop- 
erties of an estimate of the empirical mean when sam- 
pling with the BSS. 

Lemma 1 Let X[h,i] be a stratum where one samples t 
times according to the BSS. Then the empirical mean 
fi[h,i] of the samples is such that ^ 2 



E[A[h,i]] = and v tA[h,i]] < 



'[h. 



t 



The proof of this Lemma is in the Supplementary Ma- 
terial (Appendix B). This Lemma also holds for the 
children nodes of [h,i] (for a descendant \p,j], it holds 
with L"^J samples, since the procedure is recursive). 

A variant of the BSS: the BSS-A procedure 

We now define a variant of the BSS: the BSS-A sam- 
pling scheme. 

The reason why we need also this variant is that it 
is crucial, if two children of a node have obviously 
very different variances, to allocate more samples in 
the node that has higher variance. Indeed, the num- 
ber of samples that one allocates to a node is directly 
linked to the amount of exploration that one can do 
of this node, and thus to the local refinement of the 
partitioning taht one may consider. But it is also nec- 
essary to be careful and have an allocation that is more 
efficient than uniform allocation, as it is not sure that 
it is a good idea to split the parent-node. In order to 



do that, we construct a scheme that uses upper con- 
fidence bounds for the less variating node, and lower 
confidence bounds for the most variating node: we 
use the r^.i] that were defined for this purpose. We 
assume that these are defined in some sub-tree 
T e of the hierarchical partitioning, and undefined out- 
side. Using such an allocation is naturally less efficient 
than the optimal oracle allocation, but however more 
efficient than uniform allocation. We illustrate this 
concept in Figure 3 and provide the pseudo-code in 
Figure 4. 



Number of samples 

optimal number of samples 



Strategies in are 
less efficient than the 
optimal allocation, but 
more than the uniform 



uniform number of samples 



uniform number of samples 



optimal number of samples 



stratum 1 



stratum 2 



Figure 3. With high probability, the children of each node 
in M n are sampled a number of time that is in the gray 
zone by MC-ULCB. 



X=BSS-A([p,j],T e ) 
if {[p + 1, 2j], [p + 1, 2j + l]}e T e then 
return 

else 



A(argmin( l lp+1 - 2jl , „ 



BSS- 



return X = BSS([p,j]) 
endif 



Figure 4. Recursive BSS-A procedure. 
3.3. Algorithm Monte-Carlo Upper-Lower 
Confidence Bound 

We describe now the algorithm Monte-Carlo Upper- 
Lower Confidence Bound. It is decomposed in two 
main phases, a first Exploration Phase, and then an 
Exploitation Phase. 

The Exploration Phase uses Upper and Lower Con- 
fidence bounds for allocating correctly the samples. 
During this phase, we update an Exploration partition, 
that we write A/" t e , and that is included in the hierarchi- 
cal partitioning. When, in a stratum [h, i] € A/" t e , there 
are more than th samples (also if the standard devia- 
tion of teh stratum is large enough), we update by 
setting Af t e +1 = Af t e \J[h + 1, 2i] \J[h + 1, 2i + 1] \ [h, i]: 
we divide [h, i] in its two children strata, and com- 
pute the r corresponding to the children strata. The 
points are then allocated in the strata according to 



r {h,i] . 

V [h,i],t ' 



a point is allocated in stratum [h, i] € Af^ if 
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> 



1 X J 



All the points are allocated inside each 
stratum [h, i] € A/" t e according to the BSS procedure. 

The Exploration Phase stops at time T, when every 
node [h, i] G is such that y- 



r ih,i\ 



n + l 



< 



4S 



We 



write 7^ the tree that is composed of all the nodes in 
N't and of their ancestors. The algorithm selects in 
this tree a partition, that we write Af n , and that is an 
empirical minimiser (over all partitions in Tf) of the 
upper bound on the regret of algorithm MC-UCB. 

Finally, we perform the Exploitation Phase which is 
very similar to launching algorithm MC-UCB on JV n . 
We pull the samples in the strata of N n according to 
the BSS-A sampling scheme (described in Figure 4). 
We compute the final estimate jl n of [i as a stratified 
estimate with respect to the deepest partition of 7^, 

An = ^2 w hhh,i],n, (6) 



i.e. Afj.: 



where A[M],n ^ s ^ ne empirical mean of all the samples 
in stratum X\h,i\ . 

We now provide the pseudo-code of algorithm MC- 
ULCB in Figure 5. 



Input: / m ax, b and S. 

Initialization: Pull to samples by BSS([0,0]). Set 

M e = {[o,o]}. 

Exploration Phase: 
while 3[h, i] G Mt : > f do 

Take a sample in BSS([/i,i]). 

if 3[h,i] G Nt : \T[hA,t = 2t h+1 ,w h a [hAt > 

6HcVA^,h< #} then 

M+i = A/? \J[h + 1, 2i] \J[h + 1, 2* + 1] \ [h, i] 
Compute r [h+li2i ] and r^+i^i+i] 
end if 
end while 

Select Af n such that J\f n = argminAfeT,f (i-Af + 

(C" max — VA) J2y£M ^T73 J 

T = t 

Exploitation Phase: 
for t = T+ l,...,n do 

Compute B [M] , t = T[fe ^ t _ i (a [hii] + y^T) for 

any [h, i] G N n 

Choose a leaf [h,i]t such that [h,i] t = 

Pick a point according to BSS-A([/i, i]t) 
end for 
Output: fi n 



Figure 5. The pseudo-code of the Tree-MC-UCB algo- 
rithm. The empirical standard deviations and means a[h,i] 
and fl n and (j[h,i\ are computed using Equation 3, 6 and 4. 
The value of rr^ji is computed using Equation 5. The BSS 
algorithm is described in Figure 2 and the BSS-A algorithm 
is described in Figure 4. 



3.4. Main result 

We are now going to provide the main result for the 
pseudo-risk of algorithm MC-ULCB. 

Theorem 2 Under Assumption 2 and 3 for the strata 
and 1 for the function F , the pseudo-risk of algorithm 
MC-ULCB is bounded with probability 1 — 5 as 



L„(Amc~ulcb) < 

[h,i]£Af r 



(wha\h,i\) 

T[h,i],n 



< min 



V 2 

r c- m 



[h,i]£Af 



2/3 
-.4/3 



+ c' n 



2/3 
"h 
,4/3 



where min means minimum over all partitions 
of the hierarchical partitioning, and C^ lax < 
32fV(l + 36 + 4/ max ) log(4n 2 (3/ max ) 3 /'5)(l/CT[o,o] + 
l)(8a [0 ,o] + l)log((3/ max ) 3 n). 

The proof of this result is in the Supplementary Ma- 
terial (Appendix D). 

A first remark on this result is that even the first in- 
equality (i.e. L 71 (Amc-ulcb) < Y,[h,i]ess n ) 
is not trivial since the algorithm does not sample at 
random according to vx [h ;] in the strata [h,i] G M n , 
but according the BSS-A. It was necessary to do that 
since in order to select wisely J\f n , one should have ex- 
plored the tree 7^, and thus it was necessary to allo- 
cate the points in order to allow splitting of the nodes 
and adequate exploration. 

Assume that minj\/E_A/ is lower bounded, e.g. the 
function F is noisy (i.e. the function s is not al- 
most surely equal to 0). Then a second remark 
is that the second term in the final bound, namely 

/ 2/3 s 2 

Cm ax (E[M]e^^3-J > is negligible when compared 

2/3 

to the second term, namely £ [h i] <=Af ^73" ■ Indeed, 
since c^o] is bounded by Assumption 1 by / max , 

^2/3 



we know that min/v" 



is smaller than 



~ , which implies that for 



one of the partitions M that realises this minimum, 
we nave o max yi^\hfi^ ^ - °™ 



.,i]GjV 



which is negligible when compared to n 4 / 3 and thus 

2/3 

in particular £[ MeA f ^jw- 
3.5. Discussion 

Algorithm MC-ULCB does almost as well as 
MC-UCB on the best partition: The result in 
Theorem 2 states that algorithm MC-ULCB selects 
adaptively a partition that is almost a minimiser of 
the upper bound on the pseudo-risk of algorithm MC- 
UCB. It then allocates almost optimally the samples 
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in this partition. Its upper bound on the regret is thus 
smaller, up to additional multiplicative term contained 
in C' maxl than the upper bound on the regret of algo- 
rithm MC-UCB launched on an optimal partition of 
the hierarchical partitioning. The issue is that C' max 
is bigger than the constant C m ; n for MC-UCB. More 
precisely, we have C' max = C miu x Clog ((3/ max ) 3 n), 
where C is a constant depending of / max and b (see 
bound on C' max in Theorem 2). This additional de- 
pendency in log(n) is not an artifact of the proof and 
appears since we perform some model selection for se- 
lecting the partition N n . We do not know whether it 
is possible or not to get rid of it. Note however that a 
log factors already appears in the bound of MC-UCB, 
and that the question of whether it is or not needed 
remains open. 

The final partition N n : Algorithm MC-ULCB re- 
fines more the partition Af n in parts of the domain 
where splitting a stratum [h, i] in a sub-partition 
B[h,i]tf is sucn that W[ hA U[ hA - J2 x eB [hAhJsr w ^ is 
large. Note that this corresponds, by definition of the 
<J\h,i] , to parts of the domain where g and s have large 
variations. We do not refine the partition in regions of 
the domain where this is not the case, since it is more 
efficient to have also as few strata as possible. 

The sampling schemes: The key-points in this 
paper are the sampling schemes. Indeed, we con- 
struct and use a sampling technique, the BSS, that 
is such that the samples are collected in a way that 
reminds low discrepancy sampling schemes 6 on the do- 
main, and provide an estimate such that its variance 
is smaller than the one of crude Monte-Carlo. We also 
build another sampling scheme, BSS-A. This sampling 
scheme ensures that, with high probability, if two chil- 
dren strata have very different variances, then the one 
with higher variance is more sampled. At the same 
time, it ensures that if finally the decision of split- 
ting a stratum is not taken, then the allocation in the 
stratum is still better than or as efficient as random 
allocation according to v restricted to the stratum. 

Evaluation of the precision of the estimate and 
confidence intervals: An important question that 
one can ask here is on the prssibility of construct- 
ing a confidence interval around the estimate that 
we obtain. What we would suggest in this case is 
to upper bound the pseudo-risk of the estimate by 
QDxejV ( m i^j + Wx^ 3 /n 1 / 3 )) 2 /n, and construct a con- 
fidence interval considering this as a bound on the 

6 Although the samples are chosen randomly, the sam- 
pling scheme is such that we know in a deterministic and 
exact way the number of samples in each not too small part 
of the domain. 



variance or the estimate, using e.g. Bennett's in- 
equality. If e.g. the noise is symmetric, then the 
pseudo-risk equals the mean squared error, and the 
confidence interval is valid, and in particular asymp- 
totically valid (see (Carpenticr and Munos, 2011b)). 
Also it is less wide (up to a negligible term) than 
the smallest valid confidence interval on the best (or- 
acle) stratified estimate on the hierarchical partition- 
ing (and then in particular than the one for the crude 
MC estimate). Indeed, the oracle variance of such 
estimate is (inf n J2 X £j\f w x&x) 2 /n which is by defini- 
tion of N n larger or equal up to a negligible term to 
(SxeJV w x&x) 2 /n, and this equals up to a negligible 
term to the upper bound on the pseudo-risk we used 
to construct the confidence interval. 

Conclusion 

In this paper, we presented an algorithm, MC-ULCB, 
that aims at integrating a function in an efficient way. 

MC-ULCB improves the performances of Deep-MC- 
UCB and returns an estimate whose pseudo-risk is 
smaller, up to a constant, than the minimal pseudo- 
risk of MC-UCB run on any partition of the hierarchi- 
cal partitioning. The algorithm adapts the partition to 
the function and noise on it, i.e. it refines more the do- 
main where g and s have large variations. We believe 
that this result is interesting since the class of hier- 
archical partitioning is very rich and can approximate 
many partition. 

Acknoledgements: The research leading to these 
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Supplementary Material for paper: "Toward Optimal Strat- 
ification for Stratified Monte-Carlo Integration" 

We first introduce the following natation. We write Bi^ajj', where Af is a cut of a dyadic tree, the sub-partition 
given by the leafs of the tree issued from [h,i] and with leaves Af (we branch partition Af on leaves [/&,»]). We 
illustrate this in Figure 6. Similarly and by a slight abuse of notations, we write for any integer I > the sub-tree 




Figure 6. Illustration of Br^ii at- 
B[h,i],i as the sub-tree ussyed from node [h,i] and extended until depth h + l. 

A. Numerical experiments 

We consider the pricing problem of an Asian option introduced in (Glasserman et al., 1999) and later considered 
in (Kawai, 2010; Etorc and Jourdain, 2010). This uses a Black-Scholes model with strike C and maturity T. Let 
(Wt)o<t<T be a Brownian motion. The discounted payoff of the Asian option is defined as a function of W, by: 

_ rT r f T ((r-isZ)t+s Wi) -I 

F((Wt)t) = e max ^ J S e\ / dt - C, 0J , 

where So, r, and Sq are constants. 

We want to estimate the price p = ¥.w[F(W)] by Monte-Carlo simulations (by sampling on W). In order to 
reduce the variance of the estimated price, we stratify as suggested in (Glasserman et al., 1999; Kawai, 2010) 
the space of {W t )o<t<T according to the quantiles of Wt, i.e. the quantiles of a normal distribution 7V(0, T). In 
other words, we re-write F :— F((W t )o<t<T,x) where x G X = [0, 1] is the quantile that corresponds to Wt- 
In this context, the noise e comes from the directions along which we do not stratify, namely (W t )o<t<T- After 
having sampled Wt according to the algorithm for stratified Monte-Carlo (e.g. MC-ULCB), we simulate the rest 
of the Brownian motion (Wt)o<t<T by a Brownian Bridge (concretely, we discretize this Brownian motion in 
order to be able to simulate it in 16 values). We choose the same numerical values as (Kawai, 2010): So = 100, 
r = 0.05, s = 0.30, T = 1 and d = 16. We choose a strike C = 90. 

By studying the range of the F(W), we set the (meta-)parameters of the algorithm MC-ULCB to A — 21og(n) 
and H = 0.31og(n) (the other parameters adjust automatically with these two meta-parameters) . Our main 
competitor is the algorithm described in (Etore et al., 2011), to which we refer to as A-SSAA, and which also 
perform adaptive allocation and stratification. 

We first observe the behaviour of MC-ULCB with a budget of n = 2000. On a typical run, algorithm MC-ULCB 
divides the domain [0,1] in approximately 15 strata that form partition Af n , and the partition is more refined 
where s and g vary more. We illustrate this in Figure 7. 

In Figure A, we display the (averaged over 10000 runs) performances of algorithms MC-ULCB, A-SSAA, and MC- 
UCB (launched on some partitions in K hypercubes of same measure) . Note first that trough the performances 
of MC-UCB launched on partitions with varying number of strata, we observe the optimal number of strata 
increases with n. We observe that MC-ULCB is more efficient than algorithm MC-UCB launched on any of 
these partitions in K strata. This is not very surprising since we only consider MC-UCB launched on partitions 
where all strata have the same size, i.e. these partitions are not adapted to the function F. We would probably 
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Typical stratification by MC-ULCB for n=2000 




0.1 0.2 0.3 0.4 0.5 O.fi 0.7 0.8 0.9 1 

P(W d <x) 

Figure 7. Stratification of the space for a typical run of MC-ULCB. 

observe slightly better results for MC-UCB if we launched it on an oracle partition with respect to F, but 
such a partition is not easy to build, even when the function F is known. Also, MC-ULCB is more efficient 
than A-SSAA, and that for any sample size. It is not very surprising since the price model for Asian option 
happens to verify Assumption 1, which is more restrictive than the assumptions made in paper (Etore et al., 
2011). This Assumption is used to tune the algorithm. In paper (Etore et al., 2011), since they do not make 
this sub-Gaussian assumption, they can not calibrate the length of the exploration phase with respect to the 
properties of the distribution, and thus fit the exploration/exploitation to the problem. 



Budget n 


n = 200 


n = 2000 


n = 20000 


Crude MC 


5.1 


5.1 10- 1 


5.1 10~ 2 


MC-UCB, K = 5 


4.65 


4.65 lO" 1 


4.64 10~ 2 


MC-UCB, K = 10 


4.56 


4.55 lO" 1 


4.55 10" 2 


MC-UCB, K = 20 


4.63 


4.49 lO" 1 


4.41 10" 2 


MC-UCB, K = 40 


4.71 


4.655 lO" 1 


4.31 10" 2 


A-SSAA 


4.32 


4.25 lO" 1 


4.13 10" 2 


MC-ULCB 


4.08 


3.95 10- 1 


3.82 10~ 2 



Table 1. Mean squared errors of the estimates outputted by the strategies for different values of n. 



B. Proof of Lemma 1 

Assume that stratum Xt h ^ has been sampled t times according to the BSS. Let (A , . . . ,Ai) € {0, 1}' be the 
(uniquely defined) decomposition in basis 2 of t, i.e. Y^ p =o A P 2 r ~ * ano - ^ = This implies by Assumption 3 
and by definition of (A r ) r , that Ylp=o ^p^T = *• We denote by T>i = (X\, . . . , X t ) the set of the t samples in 
stratum X\h,i] ■ 

By construction of the BSS, there are at most two and at least one element of T>i in each stratum of Bi/j^ij. For 
all j < 2 h+l — 1, we write X\ j the first sample in stratum [h + l,j]. Conditionally to the number t of samples, 
each of these samples is pulled randomly in stratum [h + I, j] according to ^x [h+l ^ ■ 

Let us now consider the largest p < I such that A p — 1. Let us consider T> p = T>i\ {(Xij)^^ ^^^ ( }. By 
construction of the BSS, conditionally to the knowledge that there is a re- numeration of the samples such that 
V0 < j < 2 , Xij ~ VX\ h +i j] ( an d thus conditionally only to the number t of samples since the fact that there 
is a re-numeration such that V0 < j < 2 l ,Xij ~ V[h+i,j] follows deterministically from the budget t), there 
are at most two and at least one element of T> p in each stratum of p . We note X p j the first sample. By 
construction of the BSS and conditionally to the number t of samples, each of these samples is pulled randomly 
in stratum [h + p, j] according to ^x< h+ 3 -i • 

We can continue this induction for every p such that A p — 1. We have, at the end of the induction, relabeled 
(trough the relabeling that we presented) every sample (in T>i) by X p j. We know that conditional to the number 
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t of samples, Vp/A p = 1, and VO < j < 2 h+p — 1, X p j ~ v x and also that these relabeled samples are all 
independent of each other (although the relabeling of each sample is random and is not independent of the other 
samples) . 

The empirical mean ft[h,i] on stratum [h, i] thus satisfies 

l 



s=l 



!*Y .A 

— W v t Wh 

p=o p [ h+p ,j]eB [hiihP 



Since by construction X)p=o = ^ the empirical estimate of the mean thus satisfies 



E M = E S 



P =o p [h+p,j]eB lhtihp 
Note now that the variance of this estimate is such that 

,,2 



sr- w p a Wh a 

1^ —Vlh+p,j]Ap = ^ —,V[hA A P 



v^m] -E3k E e)H+ P ,A < E ^-Ak^v < 



w'lt 2 * — ' 'Wh 
P=0 P [h+p,j]eB [Kihp 



p=0 



C. Preliminary results 

C.l. An interesting large probability event 

Lemma 2 For a stratum X[h.i] of the hierarchical partition, write (^X^ h ^ . . . , X^ h i ^ the samples collected 
by BSS in stratum X\h t i\ (or by BSS in a stratum of smaller depth). Consider the event 



2l>g(*)J -1 



2l>g(t)J 



£ : n n \ k 2 u°g(t)j e { x [h,i]>« 2 Lio g (*)j e 

[h,i]:h<H t=2 \ \ a=0 a'=0 



X 



[h,i] ,a' 



[h 

where A = 2^2{l + 36 + 4/ max ) log(4n 2 (3/ max ) 3 /<$) and H = [ 
Note also that for h > H,\/i < 2 h — 1, we have 



°~[h; 



(J) 



log ((3/„ 



log(2) 



-J +1. ThenV{(,) >l-6. 



W[h,i]0-[hA] < 



2/3 
\h,i) 
,1/3 



Proof Probability of the event £ 



Let [h, i] be a stratum of the hierarchical partitioning such that h < H and t > 2. Let I = [log(£)J . By definition 
of the BSS, we know that for s < 2 l , sample X^ q^, conditionally to the s — 1 other samples, is sampled uniformly 
inside the strata X\h + i^ that contain no samples, and independent of the other samples. 

Using the results from Lemma 13, we know that with probability 1 — <5, the estimate of the standard deviation 
computed with the 2 l first samples satisfies 



2>-l 



2>-l 



T[h,i] 



b=0 



< 2 



< 2 



< 2 



(1 + 36 + 4V - ) log(2/J) 
2 l 

2(1 + 36 + 41/) log(2/<5) 



2(l + 36 + 4/ max )log(2/<5) 



Toward Optimal Stratification for Stratified Monte-Carlo Integration 



By the definition of H, we know that there are less than 2 x 2 H strata in the hierarchical partitioning of depth 
smaller than H. Because of the definition of A, we have P(£) > 1 — 6. 

Characterisation of the strata of depth bigger than H 

Consider a node [h,i] of depth h> H. As both m and s are bounded by / max (see Assumption 1), then 



W[h^[h.i] = yfWh.tJ I s 2 (x)dx + y/wh~ii / (g(x) - ii[ hA ) 2 dx 

V Jx lh,i] V JX \h,i] 



< 



V Jx lh,i] V Jx iw 



< 3W[h,i]fmax- 

As h> H, we have W[h,i] < (f)^ — (3/^ — ) 3 ^;- From that we deduce that for h> H, 

2/3 

w [hA a [hA < -j^. 

C.2. Rate for the algorithm MC-UCB 

We first prove the following result. 

Proposition 1 Let Assumption 2, 3, and 1 hold. Assume that n > 2BJ2 q £j^ w q ^ 3 n 2 / 3 (with B = 

- ~g~; ~ — -for any < 8 < 1, the algorithm MC-UCB on a partition M n satisfies on £ 7 and thus with 
probability at least 1 — 5, 



T Pi n n v ' rt 4 / J n rr' 6 



where C m m = (4\/2A+ T,j^ n A) and 



Tp. n > A PiSjVii (n-B( ^2 w l 



/3\ n 2/3 



UV2Z+E„ n A) 

where B — - 



Proof Step 1. Properties of the algorithm. For a node q € Aft+i, we first recall the definition of -Bg,t+i 
used in the MC-UCB algorithm 

W„ /„ rr 1 \ 



1 q,t \ w q n L > J 



Using the definition of £ and the fact that if node q is in A/j+i, then T qtt+ i > lAw^ 3 n 2 ^ 3 J , it follows that, on £ 

^<B„ +1 <^L + V3 1 ). (8) 



Let t + 1 > 2_fC + 1 be the time at which an arm g is pulled for the last time, that is T q , t — T q n — 1. Note that 
there is at least one arm such that this happens as n > AK. Since at t + 1 arm q is chosen, then for any other 
arm p, we have 

B P: t+i < B q j+i . (9) 
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From Equation 28 and T qt = T qn — 1, and also since by construction of the algorithm T qn > 2, we obtain on £ 

(10) 



± q,t \ w q n L ' J / 



Furthermore, since T Pj4 < T p .„, then on £ 

W„cr n „ Wp(T, 
L p,t P>n 

Combining Equations 29-11, we obtain on £ 



B p , t+1 > ^ > (11) 



|^(T g> „ - 1) < w q L q + 2^— i— ] 
p,n y w q n L i 6 J 



w p a. 



Summing over all q such that the previous Equation is satisfied, i.e. such that T q _ n > [w q ^ 3 n 2 ^ 3 J , on both sides, 
we obtain on £ 

£ (T,, n -1)< X! ^ 9 L + 2V2A- ' ^ 



P '™ 9 |T 3 ,„>LA^ 3 „V3j 9 |T g ,„> L ^ /3 nV3j V W <3 " , 

This implies 



P,n 3 q=1 \ w q ' n 1 / 6 / 



w p a p 



(12) 



Step 2. Lower bound. Equation 12 implies 



on £, since T Q;n — 1 > (as T g . n > 2). Finally, if n > 2A kj^ n 2 ' 3 , we obtain on £ the following bound 

^<^ + (4^ + ^)^^. (13) 

Step 2bis. Lower bound on the number of pulls. By using Equation 13 and the fact that > 1 — x one 
gets 



/3^2/3 



. „ (4v / 2A+S A r„A) 

where is = ^ -. 

This concludes the proof. 

D. Proof of Theorem 2 
D.l. Some preliminary bounds 

Let c = (8S + l)VA. Note that c > 1. 

Let [/i, i] be a stratum that is explored during the Exploration Phase, and split in its to children. 
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This implies that Wj l a'[f l ,i] > QHcy A {* /3 . By definition, for j € {2i, 2i + 1} 

2/3 

( w h+1 a [h+l ^+c^A-^ ^ I a 

?>+lj] = ( : JlM 1 i w ^+l (T [?i+lj-] - u 'h+l CT [h+l,j] > 2cVA — 



/'/Ml w /ixi iy /i+l,j~ /i+1.7 — — i/o 

( 2/3 

I w g r ,, Jim 1 1 ^ft+i^+Li-] - < ~ 2cVA ^i7 



2/3 



w^+i min (o"[/i+i,j],g[/t+i,j-]) + cVA^w 1 
w/^^i] ' 2 

( 2/3 ^ 

/~ w h+l \ 

x IN |io h+ io-[ fc+ ij-] - w/i+io-^+ijjl < 2cVA— r jj \ 



2/3 

1\ 



where j is the complementary of j in {2i, 2i + 1}. Note that the three indicators used in the definition of r 
form a partition of the domain. 

Lemma 3 If on £ a node [h,i] has two children [h + l,2i] and [h + \,2i + 1] that have been explored by the 
algorithm, then r [h+1>2i \ + H^+i^+i] < r [M' 

Proof Note first that Wh+iCf^+i .j-} + w h+i&[h+i.j] < w h&[h.i] (by definition of a and ct, and also because of the 
properties of the empirical variance) . 



The result follows from the definition of r as for j e {2i,2i + 1}, 

2/3 2/3 
\ U>h°-[h,i] J \ *»h&[h,(\,t 



< 1. 



Lemma 4 For any stratum X[h,i]> if r [h,i] of depth smaller than H is defined then on £ 

{2H-h) ( . /T^IMW ^ {H + 2h) ( A /r w M\ 
2g (^[M g [M " C ^^I73-J ^ r IM ^ # [ w [h,i]°~[h,i] + C V vl^j^ J . 



Proof The proof is done by induction. Note first that r[ j = u^oj^^o] + c-\fA J"'° 3 ] . The result is thus satisfied 
for node [0,0]. 

Assume that the property of Lemma 4 is satisfied for a given [h, i] on £. 

Assume that the children of this node are opened. This implies that Wh<J[h,i\ > QHcy/A — ^ , i.e. 

2/3 

— >3c— (14) 
2i? WhO-[h,i] 

Let j € {2i,2i + 1}. Note first that + w^+iCTf/i+i i3 i < (by definition of <r and a, and also 

y, 2 / 3 

because of the properties of the empirical variance), and that on £, |w? l o'[/i,i] — Wh<?[h,i]\ < 2\J~A as a node 
is open only if there are enough samples in it, i.e. if there are more than n 2 / 3 J samples. This together 

with Equation 14 implies that 

2/3 2/3 

W[h,i]Q-[hj] - cVA-^- ^ W[ hA a[ hAt t - ScVA-^0- ^ ^ 

^h.iptV] 2iJ ' 

as c > 1. In the same way 

2/3 

w [hd]V[h,i] 2iJ 
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. . ...2/3 



By Equation 15 

, ^ ) f[ M > (^1^+1.1 " cVA ^ j (-^) (1 - _ 

> [w h+ x<T [h + hj ] - c ^ A ^rj3 ) ( 2H >' ( ^ 



In the same way, by Equation 16 



WhV[h, 



Jr [M < + c^^J (— ^— )(1 + — ) 

(^ +1 a [ft+1>j] + cVA-^) (1 + — + — + 



2/3 



< (18) 



as h < H. 



Assume that l^+id^+ij] - u^+io-^+i j-j | < 2c-/A^f^. Then " +1 ^fg^-j " 1/3 < \- It implies that, by 
Equation 17 



2/3 

> (w fe+ i<7 [/l+lj1 - cVA-^ J ( — ). (19) 



^2/3 

, w 2 ^ w h+\G[h+X j1~t~ c V^ 1/3 1 1 

Assume that I^+kt^+ij] - w/j+id^+ij-]] > -2cVA-^±f. Then - " > 2 - :t implies that, by 

by Equation 18 

2/3 

[fly 



2 " V w h cr[h,{\ ' 

2/3 



2/3 

From Equations 17 and 19, from the definition of r, and from the fact that ( w h a [h ~ — ) r [ h A — 



/u)h+io-[h + i,j]+c-/A ^+ 3 1 \ 



we deduce that 

rih+ij] > [w h+1 a [h+1J] -cVA-- [ ^-J( — ), 

and finish the induction for the left-hand-side on £. 

In the same way, by combining Equations 18 and 20, we finish the induction for the right-hand-side on £. 
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Corollary 1 For any stratum X[h,%\) tf r [h,i] * s defined then on £ 

(2g - h) ( fA W l&\* ^ (H + 2h) { r-w { l% 

Proof This is straightforward from Lemma 4, by the definition of £ and as c > 1. 
Lemma 5 for any stratum X^ h i ^ , if rr/^,-] is defined then on £ 

»>, 

Proof Let [/i, i] be a node. 

2/3 

Assume that the children of this node are explored at time t. This implies that w^a^] > 6H c^/A^j^ , and then 
by Lemma 4, on £, (as 2 ^ fc > |). 

2/3 



2 V nV3 n i/3 

K 2 / 3 
5 i—rW h 

as H > 2. This implies as c > 8£-\/A that 

nM(^)>^/ +V /3. (2i) 



By Equation 15 (as ^H=h > i) 



2iJ 

2/3 



i AT^h + i , 2 /3 



, 2/3 



This implies as c > 8£ V A that 

2/3 



( Wh+1 ° [h+1 *l + ^^ )r lh J£) > (22) 
Let j* = arg minj rr/j+iji . For j = {2i,2i + 1}, we know that from the definition of r, f[h+i,j] > 

2/3 



'[/i,i]> 2 



From that and Equations 21 and 22 we deduce the Lemma. 



D.2. Study of the Exploration Phase 

Lemma 6 On £, the Exploration phase ends at T < n and all the nodes x of partition Af% are such that 
< ^ and > & 

Proof Let T be the time at which the exploration phase ends (if it does not end, write T = n). 
One needs to pull a node in Af% at a time If < T if and only if 

rv 4£ 

> 



T x +> + 1 n 
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We thus know that the last time stratum X x is sampled during the Exploration Phase (and thus at the end of 
the Exploration Phase) 

r x 4S 
T X)T n 

If stratum X x is not sampled during the Exploration Phase after having been opened, then 

T x>T = [Aw 2 J 3 n 2 / 3 \. 
Note that by Lemma 5, on £ r x^=, > Aw x ^ 3 n 2 ^ 3 . From that we deduce that 

r x 4S 

— — > — , 
T X<T n ' 

and from that together with the fact that we only sample a node at time t < T if > — , we deduce the 



second part of the Lemma, i.e. that on £, Vx € Af%, ip^- > — 



•x ^ 4S 

ni T a 

Note now that X^ga^ Tx — r [o.o] = ^ : it i s straightforward by Lemma 3. This directly leads to 



r a , T . 

This directly implies that X)xe./V c < f < n, which leads to the desired result, i.e. that the Exploration 
Phase ends before all the budget has been used. This implies that on £, Va; € Af£, T r * +1 < 

Lemma 7 Let x be a node such that w x <t x > HHc\/A w fi 3 and also such that, for all its parents, w y a y > 

n 1 



2/3 
— ■ 



TTien on af £/ie end T o/ i/ie Exploration phase phase, node x is open, i.e. x € , which also implies 
T x , T >AwTn 2 '\> 2). 

Proof The result is proven by induction. Assume that there is a node x that satisfies the Assumptions of 

Lemma 7. Then u>[o,o]°~[o,o] > UHcVA -^0 . Note first that after the Initialization, i.e. at the time t = L^4« 2 ^ 3 J 
when T[o.o],t = L^^ 2 ^ 3 ] > i- e - when the decision of opening or not the node is made, we have on £ that 



- w [o,o] 



w [o,o] 



2/3 

.) i [0,0 

w [0,0]& [0,0] > w [o,o] cr [o,o] - z vA—^ 

2/3 

> UHcVA 



n l/3 
2/3 

> 6iWA- 



- w [o,o] 



„l/3 ■ 

The node [0, 0] is thus opened on £ . 

Assume now that an ancestor [h, i] of node x is open. By Lemma 1, we now that on £ 

(2H-h)' _w 2/3 



2\ n 1 / 3 nV3 

2/3 

n l/3 
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By Lemma 7, we know that at the end T of the Exploration Phase, with T < n on £, we have r ri ^ +1 < ^p- 
As c > 8Ev^4, we have by using the previous result that T[/i,i],x > Awj^rt 2 / 3 . By the definition of A and 
the fact that h < H , we know also that Aw^^n 2 ^ 3 > 2, which implies that T^^^ > 2. This, together with the 
fact that w^^a^.ii.T > l^-ff Aw/j^n 2 / 3 on £, implies that node [h,i] is open and split in its too children. 
We have thus proved the result of the Lemma by induction. 

Lemma 8 Let T be the end of the Exploration Phase, and let x £ T£ ■ Then on 

r 2/3 2/3 

„ /ow x a x n /—rW x n '°\ 

T xT < max — 15cVA^^= . 

* V 6E E ' 

Proof Let T be the end of the exploration phase. 

Let i € T„ e . Let N be the subset of the partition J\f% that covers x. Let y € A/". By Lemma 6 we have on £ 

> 4— , 



T V-T 

which leads directly to 



4E 

Note that by Lemma 3 one has X^yeW r v — r x- One thus has 

T vT < > -^=- < -^r-. (23) 

Note now that by Corollary 1, we have on £ r x < S^u^cr^ + 2c\J~A^-p^ . From that and Equation 23, we deduce 
that on £ 



^2/3 



1/3 ; 4S 



E 2/3 9/3 

/bw x a x n r~rW x n'° 

< max = — , locv A = 

V 6E E 



This concludes the proof. 



D.3. Characterization of the E/v"„ 

The algorithm selects a partition Af n such that 



Af n e arg^mrn (E^ + 



2/3 

EWy 
n l/3 



with = max(B, 14ifcV^4) + 2v^4 and 5 = 16v / 2Ac(l + i). 

Note that for every partition A/" G 7^f, as all the nodes of 7^ are such that T Xt7l > Aw 2 / 3 ?"! 2 / 3 > 2 by the structure 
of the algorithm. One thus has on £, for any Af partition included in 7^ e , that 

i^-e^i<vz£^! 

because by construction every node of 7^f has depth smaller than H. 



Toward Optimal Stratification for Stratified Monte-Carlo Integration 



We thus have for the selected partition Af n that, on £, 

2/3 



^ n + (C max -2VA) ]T ^ < 

ye A/",, 



mm 

A/"eTf 



+ c max y^ — j 



2/3 



yeAf 



/3 



(24) 



„2/3 



Let S be the set of all nodes x such that all their ancestors y are such that w y o~ y > 14ffcv A n ^ /3 . This implies 
because <r y is positive, and because C max > 14HcV A that 



mm 

J\fes 



£W + Cmax 5^ f 



2/3 



yeA^ 



/3 



= mm 

M 



SaA + Cmax f 



2/3 



/3 



where min/v is the minimum over all the partitions in the entire hierarchical partitioning. 
Lemma 7 states that on £, S C 7^f. This implies that 



mm 



^aa + C max ^ ^ 
yeA" 



, 2 / 3 



rW3 



< min 

Afes 



+ c max y^ i 



2/3 



/3 



By combining Equations 24, 25 and 26, we obtain on £ 

2/3 



yeAC 



sa/- + c max y^ — | 



2/3 



veAf 



/3 



since C max - 2V A > B. 



D.4. Study of the Exploitation phase 

Lemma 9 At the end of the Exploitation phase (end of the algorithm) one has \/x € M n 



w x a x 

Ty, . 77, TL 



B V ^ 



2/3 



yeAA„ 



/3 



w/iere 5 = Vo\fZAcQ + I). 



(25) 



(26) 



(27) 



Proof Step 1. Lower Bound in each node Let us first note that by Lemma 6, we know that on £, at the 
end T < n of the Exploration Phase, we have Y^xeAf e Tx,t < f ■ There is still a budget of at least pulls left 

for the Exploitation phase. Note first that as a node x is opened only when there are L^L«^ 3 n 2 / 3 J points in it, 

SO Vx € M n ,T x .T > f U^V/ 3 . 

Step 2. Properties of the algorithm. We first recall the definition of B q , t +\ used in the MC-UCB algorithm 
for a node q G N n 

1 

'1 ' 1 ' 



5, 



9 ,t+l 



!/3 1/3 

w q n 1 -' 



Using the definition of £ together with the fact that, by construction, at a time £ of the Exploration Phase, 
T q ,t > [Awq /3 n 2 / 3 \, it follows that, on £ 



T, 



q,i 



T, 



q,i 



^ + 2 ^T73-I73 



(28) 

Let i + l>T + lbe the time at which an arm q is pulled for the last time, that is T qt — T qn — 1. Note that 
there is at least one arm such that this happens as n > T by Lemma 6. Since at t + 1 arm q is chosen, then for 
any other arm p, we have 



B 



P ,t+i 



< B, 



q,t+l 



(29) 
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From Equation 28 and T qt — T q n — 1, we obtain on £ 



B, 



L 9,* — J -<1, , < 
W„ 



q,t+\ 



< 



T, 



. \ V3 1/3 



T — 1 



. w a n L i A 



Furthermore, since T P:t < T p n , then on £ 



-Dp,t+1 ^ =f 7^ 



(30) 



(31) 



Combining Equations 29-31, we obtain on £ that if at least one sample is collected from stratum q after the 
Exploration Phase, then 

w n a„ 



^ (T q ,n -l)<w q (a q + 2V2A^— 
1 P,rt y w q n 1 ' 6 1 



(32) 



Step 3: The Exploration Phase has not deteriorate the performances of the algorithm. 

If Ty jTl > Ty^x, then samples are pulled from y after the Exploration Phase. By summing over these nodes on 
Equation 32, we obtain that, on £, for any x, 



y^- J2 ( T ^« ~ - J2 w v[ <J 

X,n v\T y , n >T ViT v\Ty, n >T y , T V 



1 



•i t 2\2A — 



< E~ 



< E~ 



2V2AJ2 



2/3 



y|T H> „>T B-T 



,1/3 



2 V / 2A£ 



,1/3 



2/3 



(33) 



where E = 53j/|t „>t t w y a y The passage from line 2 to line 3 come from the fact that T y>n > T y ^ > ^^tjs- 
Lemma 8 states that on £, for all x 6 Af n C T„ 

,o 2 / 3 2/3 



Note also that by Step 1, on £, ^ < X^|t „>t t T y , n - We thus have from these two results that on £, for any 

xeJV n , 



W X (T X 

T 

x x,n 



E ( T v,n - !) > max 



y\Ty,„>T VtT 



' 4 



T 

- 1 x,n 



max 



3, ^„ /^V^ 3 

-A xJ vn— > 15cV A _ 



Ea/-„ 4Ea/„ 



(34) 



By combining Equations 33 and Equation 34, we obtain for every x € Af n that on £ 



T 



< 



riT? h n- — — 



2 /3 2/3 



3n 
' 4 



. 2 /3 



,1/3 



< 



EM, , 872AE 



,4/3 



2/3 



2/3 

30VcVZ^ 
71'/ ^ n 4 / 3 E 

2/3 



EA/^j 38V2AcE^„< ri , 1. 
" n + ~ n 4 /3 1 £ j ' 
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where we use the fact that — h n ^^— - > f and < 1 + x for x < 1 for passing from line 1 to line 2. 
We finally have 

^<^ + Bj2%, (35) 

where S = 38V2Ac(l + A). 

Step 4. Lower bound on the number of pulls. By using Equation 35 and the fact that > 1 — x one 
gets 



B 

Lemma 10 Let x € Af n . Let y be an open grand-child of x, and y\ and yi be its two children. Then 



< 



yi ~ ' y-2 



T ~ T 



where i € {1, 2}. 



Proof We consider x € J\f n such that w x a x > &Hc\f~A^fr S : otherwise it has no grand- children. 



By Lemma 8, we know that for any y grand-child of x, we have < Awy^ 3 n 2 ^ 3 . Note that at the moment of a 
node's opening, the number of points in the node is smaller than Aw^^n 2 ! 3 '. As the Exploration stops sampling 
in a stratum x when „ Ty , , < 4 — , we know that at the end T of the Exploration Phase, we have t^— > 4—. 



We prove by induction that Tp 1 - < 4— for any grand-child of x, and that for its two children y\ and V2, we have 
By Lemma 4, we know that as w x a x > QHcy A , we have on £ 



1/3 

2/3 



r x < 3(w x a x + cVA^yJ^ < 3(~w x a x ^ < ^w x a t 



By combining this result with Lemma 9 and also with the definition of Ejv„ > we have on £ 



„ 7w x a x 7/Ea^ W 3 \ 7/ W[o,o]Q-[o,o] C^ ax \ 7 X J 

2T T „ ~ 2 V n ^ n 4 /3 J - 2 V n n 4 /" 



4 /3 y - 2 n ' 



because by definition, + B Y. y eM n < °"[o,o] + , and also because E < a- [0 ,o] + ■ 

Let xi and x 2 be the two children of x. Note first that at the end T of the Exploration Phase, by Lemma 6, 
we have > 4^, where i € {1,2}. By Lemma 3, we know that r x > r Xl + r X2 > T x< t4^. This means that 

as 7 < 4 ; then then a sample will be pulled again in one of the two nodes {x\,X2~\ after the Exploration Phase. 
Assume without risk of generality that it is node x\ that is pulled. 

r x 2 < r x 1 

T ~ T — 1 ' 
Note also that I* 2 < Z" 2 . By summing, we get that 

7f^ ,n + T X2 n 1) < T-x\ ~\~ T X2 ■ 
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We thus have 



If a sample is also collected from stratum X2, then the same result applies also for x±. Otherwise, it means that 



— — rj:2 > 4—, and as one sample is collected in xi, we have J* 1 < 4—, so we have in any case 



Tx\ - ~r 

The recursion continues in the same way for any child y of x such that WyCTj, > QHc^/A—yjj (otherwise it has no 
children). Indeed, the budget in the terminal nodes of the Exploration partition Af^ does satisfy this property. 

Lemma 11 Let x be a node ofAf n - Let Af x be the sub-partition of nodes in that cover the domain of x. One 
has on £: 



E 



(WyQ-y) 2 (W x a x ) 2 



Proof The result of the Lemma follows by induction. 

Let us consider a node x £ N ni and let M x be the sub-partition of nodes in that cover the domain of x. 

Let yi and ?/2 be two nodes of Af x that have the same father-node y. Assume without risk of generality that 
r yi — r V2 ■ 

Lemma 10 states that 

r 



T Vun > (Ty^ n 1). 

' Vi ' ' Vi 



As T yit n + Ty 2Jl = T Vjfl , we have by the previous Equation 



In the same way, we obtain 



and 



T < ^ (T 

' 2/1 ~ ' J/2 



V — (T y ,n - 1) < r WiB < — ^ — (T y , n + 1). (36) 



(T,,n ~ 1) < T y2 , n < - (Ty.n + 1). (37) 



r yi + r yi r yi + r i/2 



From that we deduce that if r yi < r y2 , then T yi _ n < T, 



If r^j = rj, 2 , this implies that \T y2tTl — T y3 n \ < lj and the last sample is pulled at random between the two strata. 
From that we deduce that <Tyi - — h ^ w j3 <Jv2 ^ < ^"' Ty - > , in the same way that in Lemma 1. 

Assume now that r yi < r V2 . Note now that on £, because of the definition of r, we have on £ 



r yi -j, w yi a yi 

r yi ' r V2 w y\ a y\ "t" w y2 (T y2 

By combining that with Equation 36, we get on £ 

w ^ lfT ^ (t +1XT 

w yi (T yi + w V2 a y2 
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which leads to 



In the same way, as on £ 



we have 



T ~ (T +1) ' { ' 



r V2 w V2 a y2 
r yi ' r V2 w y\ a y\ + w y 2 <T y 



"V 11 m ^ w y\ G y\ + w y2 G V2 (39) 



Ty 2 ,n -0 



We deduce from Equations 38 and 39 that on £ 

w yi 11 yi ^ w y2 (J y2 
T ~ T 

From that, together with the fact that r Vl < r V2 and T yi>n < T y2i1l , we deduce because of variance properties 
that 

{w yi a yi ) 2 (w yi a y2 ) 2 < (w yi cr yi ) 2 2 {w yi a y2 ) 2 < {w y a y ) 2 
yi,n y2,n y, n y, n yi n 

and note that as y% and y 2 are terminal nodes of Tf, then ^? gyi - — h CT " 2 - correspond to the variance of 
the stratified estimate on these nodes. 

In the same way, by induction, for any child y of x that is in 7^f, we also have 

(wy(T y ) 2 > (w yi a yi ) 2 (w yi cr y2 ) 2 > x ^ (w x a x ) 2 

rp — rp T 1 _ ' J T ' 

J S," ± y\-n z£j ^ J i,« 

which is the desired result in the specific case where y = x. 
D.5. Regret of the algorithm 

All the nodes in Af% are sampled in a homogeneous way, so it is coherent to define the risk as 

T (w x a x ) 2 

w - 2^ —f • 



By Lemma 11, we have on £ 

_ \ - (w x a x ) 2 ^ \ - (w x a x ) 2 

Now by Lemma 9, we have 



7 1 — 7 1 



I \2 V2 2/3 

£„ < 2^ -r ^ — + 5S AA„ 2^ ^173 • 



Finally, because of Equation 27 



V 2 2/3 

+ BY,^ > ^3 < mm 

y&N n 



V 2 m 2/3 
n L — ', n 1 ' 13 



Then by using again that M n is the empiric minimizer of the bound, i.e. Equation 27, and also by upper bounding 
C , we obtain the final result. 
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E. Large deviation inequalities for independent sub-Gaussian random variables 

We first state Bernstein inequality for large deviations of independent random variables around their mean. 

Lemma 12 Let (Xi, . . . , X n ) be n independent random variables of mean . . . , /i„) and of variance 
(a i, . . . ,o~n). Assume that there exists b > such that for any A < J, for any i < n, it holds that 

r i / » 5t a \ 

E 



1 / A 2 £T 2 \ 

exp(A(Xj — Hi)) < exp ( 2(i-\b) ) ■ Then with probability 1 — 6 



n ^—f n ^— ' V n n 

i=l i— 1 ' 



Proof If the assumptions of Lemma 12 are satisfied, then 



ex P ( A (E"=i ^ ~ Mi)) > exp(nAe) 

ex P ( HJ2? = i X i-J27=i 



< E 



cxp(nAe) 



<nr=iE 



exp 



cxp(Ae 



< exp(^£" =1 ^± 



(l-Ab) 



nXe) 



By setting A = — n ^ +bnt we obtain 

n n 2 2 



i=l 



By an union bound we obtain 



n n 

"(l^Xi-^/Xil >ne) <2exp(- 



This means that with probability 1 — 5, 



|I - i£>l < , 2( " E "'° ?)1 ° e(2/<) + 

n t-f n f-f V ft w 

l — 1 2=1 ' 

We also state the following Lemma on large deviations for the variance of independent random variables. 

Lemma 13 Let (X\, . . . ,X n ) be n independent random variables of mean (jtii, . . . ,fJi n ) an d of variance 
(a i, . . . , ofj. Assume that there exists b > such that for any A < for any i < n, it holds that 



E 



cxp(X(Xi - fa)) 



< exp 



\2(l-\b) J 



id also E 



exp(A(X i - - Act?) < exp ( 2 (i_a 6 ) ) 



Lei V = i X)i(Mi — n Si Mi) 2 + En °? ^ e ^ e variance of a sample chosen uniformly at random among the n 
distributions, and V — — X)"=i {p^i — — Yl'j—i Xj) 2 the corresponding empirical variance. Then with probability 



< 2 



(1 + 36 + 47) log(2/o) 



(40) 
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Proof By decomposing the estimate of the empirical variance in bias and variance, we obtain with probability 
1-6 

KE<*-;£«.>"-C;E*-i;2>>" 

i j i i 

=- - + 2 - E^ - w)- E^* - - E ^) 

n z — ' n — ' n — ' n z — ' 

i j i i 

= 1 ^ - ^ + I j> - 1 2 - (- e ^ - 1 E ^) 2 - 

■II L 4 r) & 4 ii & 4 J nrt £ 4 i i & 4 



We then have by the definition of V that with probability 1 — 5 



(41) 



If the assumptions of Lemma 13 are satisfied, we have with probability 1 — 5 



n n 

»(E(^- Ml ) 2 -E ff2 ^" e ) = 



i=l i=l 



< E 



ME - ^ 2 - E ct2 )) ^ ex p(^^) 
»=i t=i 

cxp (A(Er=ii^-Mii 2 -Er=i^) 



< 



exp(nAe) 
exp(Ae) 



< 2exp(^- V — 
^ 2 ^ 2 1 



^ 2(1 - Aft) 



nAe). 



If we take A = ^ — n ^ +nblL we obtain with probability 1 — 6 

n n 

p( - - E ff2 ^ ne2 ) ^ ex p(- 



2 2 

n e 



(42) 



By a union bound we get with probability 1—6 that 



j^-^-j^ofl >ne) < 2exp( 



)• 



This means that with probability 1 — 6, 



Ti ^ — ^ Tl ' 1/ 71 71 



(43) 
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Finally, by combining Equations 41 and 43 with Lemma 12, we obtain with probability 1 — 6 



\V_ V \< 4 (jgkiM + ^\og(2/5f + / 2(lE?=ig?)log(2/<S) + blog(2/S) 



< / 2(^ElU^ 2 )log(2/^) + (3fr + 4lElLi<xPlo.g(2/a) 
~~ V ft n 



l2Vlog(2/S) , (36 + 4y)log(2/£) 



n n 
when n > Mog(2/5) and because V > ^ X)"=i °f • 
This implies with probability 1 — 6 that 



r j 2Vlog(2/6) | log(2/<5) <t> | (36 + 4tQlog(2/<*) | log(2/S) 
n 2n — n 2n 



=>a/F< + 2 



(1 + 36 + 4V) log(2/<5) 



On the other hand, we have also with probability 1 — 5 



r r l2Vlog(2/6) , (3& + 4V)log(2/«) 



^ x /y<^ + 2 ./(l + 36 + 4V)log(2A) 



Finally, we have with probability 1 — 6 



\vv-vv\<2j^± sb ±^m. (44) 

n 



